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Abstract — Being able to detect and recognize human activ- 
ities is essential for several applications, including personal 
assistive robotics. In this paper, we perform detection and 
recognition of unstructured human activity in unstructured 
environments. We use a RGBD sensor (Microsoft Kinect) as 
the input sensor, and compute a set of features based on 
human pose and motion, as v^ell as based on image and point- 
cloud information. Our algorithm is based on a hierarchical 
maximum entropy Markov model (MEMM), which considers a 
person's activity as composed of a set of sub-activities. We infer 
the two-layered graph structure using a dynamic programming 
approach. We test our algorithm on detecting and recognizing 
twelve different activities performed by four people in different 
environments, such as a kitchen, a living room, an office, etc., 
and achieve good performance even when the person was not 
seen before in the training setjj 

I. Introduction 

Being able to automatically infer the activity that a person 
is performing is essential in many applications, such as in 
personal assistive robotics. For example, if a robot could 
watch and keep track of how often a person drinks water, 
it could prevent the dehydration of elderly by reminding 
them. True daily activities do not happen in structured 
environments (e.g., with closely controlled background), 
but in uncontrolled and cluttered households and offices. 
Due to its unstructured and often visually confusing nature, 
detection of daily activities becomes a much more difficult 
task. In addition, each person has his or her own habits 
and mannerisms in carrying out tasks, and these variations 
in speed and style create additional difficulties in trying to 
detect and recognize activities. In this work, we are interested 
in reliably detecting daily activities that a person performs in 
a home or office, such as cooking, drinking water, brushing 
teeth, talking on the phone, and so on. 

Most previous work on activity classification has focused 
on using 2D video (e.g., |26l [TOl) or RFID sensors placed 
on humans and objects (e.g., ll4T1l ). The use of 2D videos 
leads to relatively low accuracy (e.g., 78.5% in |19|) even 
when there is no clutter. The use of RFID tags is generally 
too intrusive because it requires a placement of RFID tags 
on the people. 

In this work, we perform activity detection and recogni- 
tion using an inexpensive RGBD sensor (Microsoft Kinect). 
Human activities, despite their unstructured nature, tend to 
have a natural hierarchical structure; for instance, drinking 
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Fig. 1. The RGBD data from the Kinect sensor is used to generate 
an articulated skeleton model. This skeleton is used along with the 
raw image and depths for estimating the human activity. 

water involves a three-step process of bringing a glass to 
one's mouth, tilting the glass and head to drink, and putting 
the glass down again. We can capture this hierarchical 
nature using a hierarchical probabilistic graphical model — 
specifically, a two-layered maximum entropy Markov model 
(MEMM). Even with this structured model in place, differ- 
ent people perform tasks at different rates, and any single 
graphical model will likely fail to capture this variation. 
To overcome this problem, we present a method of on-the- 
fly graph structure selection that can automatically adapt 
to variations in task speeds and style. Finally, we need 
features that can capture meaningful characteristics of the 
person. We accomplish this by using the PrimeSense skeleton 
tracking system |27| in combination with specially placed 
Histogram of Oriented Gradient (H computer vision features. 
This approach enables us to achieve reliable performance in 
detection and recognition of common activities performed in 
typical cluttered human environments. 

We evaluated our method on twelve different activi- 
ties (see Figure [3]) performed by four different people 
in five different environments: kitchen, office, bathroom, 
living room and bedroom. Our results show a preci- 
sion/recall of 84.7%/83.2% in detecting the correct activity 
when the person was seen before in the training set and 
67.9%/55.5% when the person was not seen before. We have 
also made the dataset and code available open-source at: 
http : //pr . cs . Cornell . edu/humanactivities 

II. Related Work 
There is a large body of previous work on human activity 
recognition. One common approach is to use space-time 
features to model points of interest in video (Him. Several 
authors have supplemented these techniques by adding more 
information to these features | [U gOl SIl l9| |25 , 30 1. How- 
ever, this approach is only capable of classifying, rather than 
detecting, activities. Other approaches include filtering tech- 
niques 1 29 1 and sampling of video patches 1 1 1. Hierarchical 
techniques for activity recognition have been used as well, 
but these typically focus on neurologically-inspired visual 
cortex-type models IS |32l |23l [28l. Often, these authors 



adhere faithfully to the models of the visual cortex, using 
motion-direction sensitive "cells" such as Gabor filters in 
the first layer |[m[26l . 

Another class of techniques used for activity recognition 
is that of the hidden Markov model (HMM). Early work by 
Brand et al. |2| utilized coupled HMMs to recognize two- 
handed activities. Weinland et al. 1381 used an HMM together 
with a 3D occupancy grid to model human actions. Martinez- 
Contreras et al. |21 1 utilized motion templates together with 
HMMs to recognize human activities. As well as generative 
models like HMM, Lan et al. 1 14] employed a discriminative 
model which was aided by interaction analysis between 
people. Sminchisescu et al. 1331 used conditional random 
fields (CRF) and maximum-entropy Markov models, arguing 
that these models overcome some of the limitations presented 
by HMMs. Notably, HMMs create long-term dependencies 
between observations and tries to model observations, which 
are already fixed at runtime. On the other hand, MEMM 
and CRF are able to avoid such dependencies and enables 
longer interaction among observations. However, the use of 
2D videos leads to relatively low accuracies. 

Other authors have worked on hierarchical dynamic 
Bayesian networks. Early work by Wilson and Bobick |39| 
extended HMM to parametric HMM for recognizing pointing 
gestures. Fine et al. |0 introduced hierarchical HMM, which 
was later extended by Bui et al. |3 1 to a general structure in 
which each child can have multiple parents. Truyen et al. 
1361 then developed a hierarchical semi-Markov CRF that 
could be used in partially observable settings. Liao et al. 
1 18 1 applied hierarchical CRFs to activity recognition but 
their model requires many GPS traces and is only capable 
of off-line classification. Wang et al. f37l proposed Dual 
Hierarchical Dirichlet Processes for surveillance of the large 
area. Among several others, the hierarchical HMM is the 
closest model of these to ours, but does not capture the idea 
that a single state may connect to different parents only for 
specified periods of time, as our model does. As a result, 
none of these models fit our problem of online detection of 
human activities in uncontrolled and cluttered environment. 
Since MEMM enables longer interaction among observations 
unlike HMM L33J, the hierarchical MEMM allows us to 
take new observations and utilize dynamic programming to 
consider them in an online setting. 

Various robotic systems have used activity recognition 
before. Theodoridis et al. |35| used activity recognition in 
robotic systems to discern aggressive activities in humans. 
Li et al. 1 17] discuss the importance of non-verbal commu- 
nication between human and robot and developed a method 
to recognize simple activities that are nondeterministic in 
nature, while other works have focused on developing robots 
that utilizes activity recognition to imitate human activities 
in [^. However, we are more interested here in assistive 
robots. Assistive robots are robots that assist humans in some 
task. Several types of assistive robots exist, including socially 
assistive robots that interact with another person in a non- 
contact manner, and physically assistive robots, which can 
physically help people [34l [24l [161 [HI [H. 



HI. Our Approach 
We use a supervised learning approach in which we 
collected ground- truth labeled data for training our model. 
Our input is RGBD images from a Kinect sensor, from which 
we extract certain features that are fed as input to our learning 
algorithm. We train a two-layered maximum-entropy Markov 
model which will capture different properties of human ac- 
tivities, including their hierarchical nature and the transitions 
between sub-activities over time. 

A. Features 

We can recognize a person's activity by looking at his 
current pose and movement over time, as captured by a 
set of features. The input sensor for our robot is a RGBD 
camera (Kinect) that gives us an RGB image as well as 
depths at each pixel. In order to compute the human pose 
features, we describe a person by a rigid skeleton that can 
move at fifteen joints (see Figure [T]). We extract this skeleton 
using a tracking system provided by PrimeSense |27|. The 
skeleton is described by the length of the links and the joint 
angles. Specifically, we have the three-dimensional Euclidean 
coordinates of each joint and the orientation matrix of each 
joint with respect to the sensor. We compute features from 
this data as follows. 

Body pose features. The joint orientation is obtained with 
respect to the sensor. However, we are interested in true pose, 
which is invariant of sensor location. Therefore, we transform 
each joint's rotation matrix so that the rotation is given with 
respect to the person's torso. For 10 joints, we convert each 
rotation matrix to half-space quaternions in order to more 
compactly represent the joint's orientation. (A more compact 
representation would be to use Euler angles, but they suffer 
from representation problem called gimbal lock 1311 .) Along 
with these joint orientations, we would like to know whether 
person is standing or sitting, and whether or not person is 
leaning over. Such information is observed from the position 
of each foot with respect to the torso (3*2) by using the 
head and hip joints to compute the angle of the upper body 
against vertical. We have 10*4 + 3*2 + 1 = 47 features for 
the body pose. 

Hand Position. Hands play an especially important role 
in carrying out many activities, so information about what 
hands are doing can be quite powerful. In particular, we 
want to capture information such as "the left hand is near 
the stomach" or "the right hand is near the right ear." To 
do this, we compute the position of the hands with respect 
to the torso, and with the respect to the head in the local 
coordinate frame. Though we capture the motion information 
as described next, in order to emphasize hand movement, we 
also observe hand position over last 6 frames and record 
the highest and lowest vertical hand position. We have 
2 * (6 + 2) = 16 features for this. 

Motion Information. Motion information is also important 
for classifying a person's activities. We select nine frames 
spread out over the last three seconds, spaced as follows: 
{-5,-9,-14,-20,-27,-35,-44,-54,-65}, where the 
numbers refer to the frames chosen. Then, we compute the 
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Fig. 2. Our two-layered MEMM model. 

joint rotations that have occurred between each of these 
frames and the current frame, represented as half-space 
quaternions (for the 11 joints with orientation information). 
This gives. 9*11*4 = 396 features. We refer to body pose, 
hand and motion features as "skeletal features". 

Image and point-cloud features. Much useful information 
can be derived directly from the raw image and point cloud 
as well. We use the Histogram of Oriented Gradients (HOG) 
feature descriptors |4|, which gives 32 features that count 
how often certain gradient orientations are seen in specified 
bounding boxes of an image. Although this computation is 
typically performed on RGB or grayscale images, we can 
also view the depth map as a grayscale image and compute 
the HOG features on that. We have two HOG settings that 
we use. In the "simple HOG" setting, we find the bounding 
box of the person in the image, and compute RGB and 
depth HOG features for that bounding box, for a total of 64 
features. In the "skeletal HOG" setting, we use the extracted 
skeleton model to find the bounding boxes for the person's 
head, torso, left arm, and right arm, and we compute the RGB 
and depth HOG features for each of these four bounding 
boxes, for a total of 256 features. In this paper's primary 
result, we use the "skeletal HOG" setting. 

B. Model Formulation 

Human activity is complex and dynamic, and therefore our 
learning algorithm should model different nuances in human 
activities, such as the following. 

First, an activity comprises a series of sub- activities. 
For example, the activity "brushing teeth" consists of sub- 
activities such as "squeezing toothpaste," "bringing tooth- 
brush up to face," "brushing," and so forth. Therefore for 
each activity (represented by z G Z), we will model sub- 
activities (represented by ?/ G F). We will train a hierarchical 
Markov model where the sub- activities y are represented by 
a layer of hidden variables (see Figure [2]). 

For each activity, different subjects perform the sub- 
activities for different periods of time. It is not clear how to 
associate the sub-activities to the activities. This implies that 
the graph structure of the model cannot be fixed in advance. 
We therefore determine the connectivity between the z and 
the y layers in the model during inference. 

Model. Our model is based on a maximum-entropy Markov 
model (MEMM) |22|. However, in order to incorporate 
the hierarchical nature of activities, we use a two-layered 
hierarchical structure, as shown in Figure |2] 

In our model, let x^ denote the features extracted from 



the articulated skeleton model at time frame t. Every frame 
is connected to high-level activities through the mid-level 
sub-activities. Since high-level activities do not change every 
frame, we do not index them by time. Rather, we simply 
write Zi to denote the i^^ high-level activity. Activity i occurs 
from time t^_i + 1 to time ti. Then {7/^*-^+^, ..., y^^} is the 
set of sub-activities connected to activity Zi. 

C. MEMM with Hierarchical Structure 

As shown in Figure |2] each node Zi in the top layer is 
connected to several consecutive nodes in the middle layer 
|^ti_i+i ^ ^ ^ti 1^ capturing the intuition that a single activity 
consists of a number of consecutive sub-activities. 

For the sub-activity at each frame y^, we do not know 
a priori to which activity Zi it should connect at the top 
layer. Therefore, our algorithm must decide when to connect 
a middle-layer node y^ to top-layer node Zi and when to 
connect it to next top-layer node 2:^+1. We show in the next 
section how selection of graph structure can be done through 
dynamic programming. Given the graph structure, our goal 
is to infer the Zi that best explains the data. We do this by 
modeling the joint distribution P(^i, y^^-i+i • • • ?/^^ |0^, ^i-i) 
where O^ = x^^-^+^, ..., x^% and for each Zi, we find the set 
of y^'s that maximize the joint probability. Finally, we choose 
the Zi that has the highest joint probability distribution. 

Learning Model. We use a Gaussian mixture model to 
cluster the original training data into separate clusters, and 
consider each cluster as a sub-activity, rather than manually 
labeling sub-activities for each frame. We constrain the 
model to create five clusters for each activity, and then 
combine all the clusters for a certain location's activities into 
a single set of location specific clusters. In addition, we also 
generate a few clusters from the negative examples, so that 
our algorithm becomes robust to not detecting random activ- 
ities. Specifically, for each classifier and for each location, 
we create a single cluster from each of the activities that do 
not occur in that location. 

Our model consists of the following three terms: 

• P{y^\x^): This term models the dependence of the sub- 
activity label y^ on the features x^ We model this 
using the Gaussian mixture model we have built. The 
parameters of the model are estimated from the labeled 
training data using maximum-likelihood. 

. P{y'^-'^\y'^-'^-\zi) (where m G {0, ..., (t, - t,_i - 
1)}). A sequence of sub-activities describes the activi- 
ties. For example, we can say the sequence "squeezing 
toothpaste," "bringing toothbrush up to face," "actual 
brushing," and "putting toothbrush down" describes the 
activity "brushing teeth." If we only observe "bringing 
toothbrush up to face" and "putting toothbrush down," 
we would not refer to it as "brushing teeth." Unless 
the activity goes through a specific set of sub-activities 
in nearly the same sequence, it should probably not be 
classified as the activity. For all the activities except 
neutral, the table is built from observing the transition of 
posterior probability for soft cluster of Gaussian mixture 
model at each frame. 



However, it is not so straightforward to build 
P(^^ti-m|^^i-m-i^^^j when Zi is neutral. When a 
sub-activity sequence such as "bringing toothbrush to 
face" and "putting toothbrush down" occurs, it does 
not correspond to any known activity and so is Hkely 
to be neutral. It is not possible to collect data of 
all sub-activity sequences that do not occur in our 
list of activities, so we rely on the sequences ob- 
served from non-neutral activities. If N denotes neutral 
activity, then P(^^^-^|^^^-^-\ z^ = A^) ex 1 - 

• P{zi\zi-i). The activities evolve over time. For exam- 
ple, one activity may be more likely to follow another, 
and there are brief moments of neutral activity between 
two non-neutral activities. Thus, we can make a better 
estimate of the activity at the current time if we also 
use the estimate of the activity at previous time- step. 
Unlike other terms, due to difficulty of obtaining rich 
data set for maximum likelihood estimation, P{zi\zi-i) 
is set manually to capture these intuitions. 

Inference. Consider the two-layer MEMM depicted in Fig- 
ure [2] Let a single Zi activity node along with all the y^ sub- 
activity nodes connected directly to it and the corresponding 
x^ feature inputs be called a substructure of the MEMM 
graph. Given an observation sequence Oi = x*^-i+^, ..., x^^ 
and a previous activity ^i_i, we wish to compute the joint 
probability P{zi, 7/^-1+1 • • • y^^\Oi, Zi-i): 

P{z,,y'^-^^'---y'^\0,,z,_,) 

=P{z,\0,, z,_i)P{y'^-^+' ---y'^ \z,, 0„ z,_i) 

ti 
=P{z^\z,_^)- H P{y'\y'-\z,,x') 



E 



P{y 






r.ti-l+1 



)P{y'-') 



We have all of these terms except P{y^\y^ ^^Zi^x^) and 

p(^yti-i-\-i^yti-i ^^,^^ti-i-\-iy ^Qi\^ terms can be derived as 



ny\y ,z^,x)- p^yt-i^,^^^t) 

We make a naive Bayes conditional independence assump- 
tion that y^~^ and Zi are independent from x^ given y^. Using 
this assumption, we get: 



P{y'\y'-\zi,x') = 



P{y'\y'-\zi)P{y'W) 

P{y') 



We have fully derived P{zi, y*«-i+i . . .yti^d, -Zj-i): 
P{z,, t/*-i+i • • • y'^\Oi, 2,_i) = P{zi\zi-i) 



ti-1+1] 



n 

t=ti-i-\-2 



P{y 



P{y*\y*-\zi)P{y^x*) 

P{y') 



Note that this formula can be factorized into two terms where 
one of them only contains two variables. 

ti 
P{z,,y'-^^'---y'^\0,,z,_r)=A- J] S(^'"''^') 

t=ti-i+2 

Because the formula has factored into terms containing only 
two variables each, this equation can be easily and efficiently 
optimized. We simply optimize each factor individually, and 
we obtain: 

mdixP{zi,y^'-^^-^ ■ ■ -y^'lOi.Zi-i) = max A 



max B{y^'-'^^,y^'-'^^) 



-mdixB{y' \y') 

yH 



D. Graph Structure Selection 

Now that we can find the set of y^'s that maximize the joint 
probability P{zi, yti-i+i . . .yti^Qi^ ^i-i), the probability of 
an activity Zi being associated with the i^^ substructure and 
the previous activity, we wish to use that to compute the 
probability of Zi given all observations up to this point. 
However, to do this, we must solve the following problem: 
for each observation y^, we must decide to which high- 
level activity Zi it should be connected (see Figure [2]). 
For example, consider the last y node associated with the 
"drinking water" activity in Figure [2] It's not entirely clear 
if that node really should connect to the "drinking water" 
activity, or if it should connect to the following "neutral" 
activity. Deciding with which activity node to associate each 
y node is the problem of hierarchical MEMM graph structure 
selection. 

Unfortunately, we cannot simply try all possible graph 
structures. To see why, suppose we have a graph structure at 
time t — 1 with a final high-level node Zi, and then are given 
a new node y^. This node has two "choices": it can either 
connect to Zi, or it can create a new high-level node z^+i 
and connect to that one. Because every node y^ has this same 
choice, if we see a total of n mid-level nodes, then there are 
2^ possible graph structures. 

We present an efficient method to find the optimal graph 
structure using dynamic programming. The method works, in 
brief, as follows. When given a new frame for classification, 
we try to find the point in time at which the current high- 
level activity started. So we pick a time t\ and say that every 
frame after t^ belongs to the current high-level activity. We 
have already computed the optimal graph structure for the 
first t' time frames, so putting these two subgraphs together 
give us a possible graph structure. We can then use this 
graph to compute the probability that the current activity 
is z. By trying all possible times t^ < t, we can find the 
graph structure that gives us the highest probability, and we 
select that as our graph structure at time t. 

The Method of Graph Structure Selection. Now we 

describe the method in detail. Suppose we are at some 
time t; we wish to select the optimal graph structure given 
everything we have seen so far. We will define the graph 
structure inductively based on graph structures that were 
chosen at previous points in time. Let Gt' represent the graph 



Fig. 3. Samples from our dataset. Row- wise, from left: brushing teeth, cooking (stirring), writing on whiteboard, working on computer, 
talking on phone, wearing contact lenses, relaxing on a chair, opening a pill container, drinking water, cooking (chopping), talking on a 
chair, and rinsing mouth with water. 



structure that was chosen at some time t' <t. Note that, as 
a base case, Gq is always the empty graph. 

For every t' < t, define a candidate graph structure Gj 
consisting of Gt' (the graph structure capturing the first t' 
timeframes), followed by a single substructure from time f-\- 
1 to time t connected to a single high-level node Zi. Note 
that this candidate graph structure sets U-i =t' and U = t. 
Given the set of candidate structures {G^ |1 < t' < t}, the 
plan is to find the graph structure and high-level activity Zi G 
Z to maximize the likelihood given the set of observations 
so far. 

Let O be the set of all observations so far. Then 
P{zi\0;Gl) is the probability that the most recent high- 
level node i is activity Zi G Z, given all observations so far 
and parameterized by the graph structure Gj . We initially 
set P{zo\0;Go) to a uniform distribution. Then, through 
dynamic programming, we have P{zi-i\0; Gf) for all t' <t 
and Sill z e Z (details below). Suppose that, at time t, we 
choose the graph structure Gj for a given t' <t. Then the 
probability that the most recent node i is activity Zi is given 
by 

P(^i|0;G*')=^P(zi,^i_i|0;Gf) 



A^ 



P(^,_i|0;G*')P(z,|0,Zi_i;Gf 



= Y,P{zi.i\0;Gt')P{zi\Oi,Zi-i) (1) 

Zi-1 

The two factors inside the summation are terms that 
we know, the former due to dynamic programming, 
and the latter estimated by finding maximum of 

P{zi^y^^-^^^ ■ ■ -y^^lOi^Zi-i), described in the previous 
section. 

Thus, to find the optimal probability of having node i be 
a specific activity Zi, we simply compute 

P(z,|0;G,)=maxP(z,|0;Gf) 

We store P{zi\0]Gt) ^ zi for dynamic programming pur- 
poses (Equation [T]). Then, to make a prediction of an activity 
at time t, we compute 
activity^ = argmaxP(zi|0) = argmaxmaxP(z^|0; G^ ) 

Zi Zi t'<t 

Optimality. We show that this algorithm is optimal by 
induction on the time t. Suppose we know the optimal graph 



structure for every time t' <t. This is certainly true at time 
t = 1, as the optimal graph structure at time t = is the 
empty graph. The optimal graph structure at time t involves 
a final high-level node Zi that is connected io 1 < k < t 
mid-level nodes. 

Suppose the optimal structure at time t has the high-level 
node connected io k = t — t' mid-level nodes. Then what 
graph structure do we use for the first t' nodes? By the 
induction hypothesis, we know the optimal graph structure 
Gt' for the first t' nodes. That is, Gt' is the graph structure 
that maximizes the probability P{zi-i\0). Because Zi is 
conditionally independent of any high-level node before 
Zi-i, the graph structure before zi-i does not affect zi. 
Similarly, the graph structure before Zi-i obviously does 
not depend on the graph structure after Zi-i. Therefore, the 
optimal graph structure at time t is Gj , the concatenation of 
Gf to a single substructure of t — t' nodes. 

We do not know what the correct time < t' < t is, but 
because we try all, we are guaranteed to find the optimal t^ 
and therefore the optimal graph structure. 

Complexity. Let n and m be the number of activities and 
sub-activities, respectively, and let t be the time. Space com- 
plexity for the dynamic programming algorithm is 0{n • t) 
since we store 1-d array of size t for each activity. At each 
timeframe, we must compute the optimal graph structure. 
By setting a maximum substructure size of T <C t, dynamic 
programming requires n activities to be checked for each 
of T possible sizes. Each check requires a computation of 
P{zi, ^^^-1+1 . . .yi^ \0i, Zi-i), which takes 0{m • T) time. 
Thus, each timeframe requires 0{n • m - T'^) computation 
time. We do this computation for each of t timeframes, for 
an overall time complexity of 0(n • m • T^ -t). 

IV. Experiments 

Data. We used the Microsoft Kinect sensor, which outputs 
an RGB image together with aligned depths at each pixel at 
a frame rate of 30Hz. It produces a 640x480 depth image 
with a range of L2m to 3.5m. The sensor is small enough 
for it to be mounted on inexpensive mobile ground robots. 

We considered five different environments: office, kitchen, 
bedroom, bathroom, and living room. Three to four common 
activities were identified for each location, giving a total of 
twelve unique activities (see Table |I]). Data was collected 



TABLE I 

Results of naive classifier, one-level MEMM model, and our full model in each location. The table shows precision and 

RECALL SCORES FOR ALL OF OUR MODELS. NOTE THAT THE TEST DATASET CONTAINS random MOVEMENTS (IN ADDITION TO THE ACTIVITIES 
CONSIDERED), RANGING FROM A PERSON STANDING STILL TO WALKING AROUND WHILE WAVING HIS OR HER HANDS. RGB(D) HOG REFERS TO 

















"SIMPLE HOG". 


















Location 


Activity 


Naive 

Classifier 

Prec Rec 


One-layer 

MEMM 

Prec Rec 


"New Persor 

RGB HOG 
Prec Rec 


Full Model 
RGBD HOG 
Prec Rec 


Skel.+Skel HOG 
Prec Rec 


Naive 

Classifier 

Prec Rec 


"Have Seen 
One-layer 
MEMM 
Prec Rec 


Full Model 
Skel.+Skel HOG 
Prec Rec 


bathroom 


rinsing mouth 
brushing teeth 
wearing contact lens 


77.7 
64.5 
82.0 


49.3 
20.5 
89.7 


71.8 
83.3 
81.5 


63.2 

57.7 
89.7 


42.2 
50.7 
44.2 


73.3 
30.8 
40.6 


49.1 

73.4 

52.5 


97.3 
16.6 
59.5 


51.1 
88.5 
78.6 


51.4 

55.3 
88.3 


73.3 
81.5 
87.8 


49.7 
65.1 
71.9 


70.7 
81.5 
87.8 


53.1 
75.6 
71.9 


61.4 
96.7 

79.2 


70.9 
77.1 
94.7 




Average 


74.7 


53.1 


78.9 


70.2 


45.7 


48.2 


58.3 


57.8 


72.7 


65.0 


80.9 


62.2 


80.0 
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from four different people: two males and two females. 
None of the subjects were otherwise associated with this 
project (and hence were not knowledgeable of our models 
and algorithm). We collected about 45 seconds of data for 
each activity from each person. The data was collected in 
different parts of regular household with no occlusion of 
arms and body from the view of sensor. When collecting, 
the subjects were given basic instructions on how to carry 
out the activity, such as "stand here and chop this onion," 
but were not given any instructions on how the algorithm 
would interpret their movements. (See Figure |3]) 

Our goal is to perform human activity detection, i.e., our 
algorithm must be able to distinguish the desired activities 
from other random activities that people perform. To that 
end, we collected random activities by asking the subject 
to act in a manner unlike any of the previously performed 
activities. The random activity contains sequence of random 
movements ranging from a person standing still to a person 
walking around and stretching his or her body. Note that 
random data was only used for testing. 

For testing, we experimented with two settings. In the 
"new person" setting, we employed leave-one-out cross- 
validation to test each person's data; i.e. the model was 
trained on three of the four people from whom data was 
collected, and tested on the fourth. In the other "have seen" 
setting of the experiment, the model was given data about the 
person carrying out the same activity. To achieve this setting, 
we halved the testing subject's data and included one half 
in the training data set. So, even though the model had seen 
the person do the activity at least once, they had not seen 
the testing data itself. 

Finally, to train the model on both left-handed and right- 
handed people without needing to film them all, we simply 
mirrored the training data across the virtual plane down 
the middle of the screen. We have made the data available at: 



http : //pr . cs . Cornell . edu/humanactivities/ 

Models. We compared two-layered MEMM against two 
models, naive classifier based on SVM and one-level 
MEMM. Both models were trained on full set of features 
we have described earlier. 

• Baseline: Naive Classifier As the baseline model, we 
used a multi-class support vector machine (SVM) as a 
way to map features to corresponding activities. Here 
SVM is used to map the features to the high-level 
activities directly. 

• One-level MEMM. This is a one-level MEMM model 
which builds upon the naive classifier. P{y^\x^) is 
computed by fitting a sigmoid function to the output 
of the SVM. Transition probabilities between activities, 
P{y^\y^~^), use the same table we have built for full 
model, which in that model is called P{zi\zi-i). Using 
P{y^\x^) and P{y^\y^~^), we compute the probability 
that the person is engaged in activity j at time t. 

• Hierarchical MEMM. We ran our full model with a 
few different sets of input features in order to show 
how much improvement our selection of features brings 
compared to the set of features that solely relies on 
images. We tried using "simple HOG" features (using a 
person's full bounding box) with just RGB image data, 
"simple HOG" features with both RGB and depth data, 
and skeletal features with the "skeletal HOG" features 
for both RGB and depth data. 

A. Results and Discussion 

Table |l| shows the results of the naive classifier, one- 
level MEMM and our full two-layered model for the "have 
seen" and "new person" settings. The precision and recall 
measures are used as metrics for evaluation. Our model was 
able to detect and classify with a precision/recall measure 
of 84.7%/83.2% and 67.9%/55.5% in "have seen" and "new 
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person" settings, respectively. It is not surprising that the 
model performs better in the "have seen" setting, as it has 
seen that person's body type and mannerisms before. 

We found that both the naive classifier and one-level 
MEMM were able to classify well when a frame contained 
distinct characteristics of an activity, but performed poorly 
when characteristics were subtler. The one-layer MEMM 
was able to perform better than the naive classifier, as it 
naturally captures important temporal properties of motion. 
Our full two-layer MEMM, however, is able to capture the 
hierarchical nature of human activities in a way that neither 
the naive classifier nor the one-layer MEMM can do. As a 
result, it performed the best of all three models. 

The comparison of feature sets on our full model shows 
that the features we use are much more robust compared to 
features that rely on RGB and/or Depth. 

In the "have seen" setting, the HOG on RGB images are 
capable of capturing powerful information about a person. 
However, when seeing a new person, changes in clothing 



and background can cause confusion especially in uncon- 
trolled and cluttered backgrounds, as shown by relatively low 
precision/recall value of 33.1%/23.5%. The skeletal features 
along with HOG on depth, while sometimes less informative 
than the HOG on images, are both more robust to changes 
in people. Thus, by combining skeletal features, skeletal 
HOG image features, and skeletal HOG depth features, we 
simultaneously achieved good accuracy in the "new person" 
setting and very good accuracy in the "have seen" setting. 

Figure |4] and Figure [5] show the confusion matrices be- 
tween the activities in "new person" and "have seen" setting 
when using skeletal features and "skeletal HOG" image and 
depth features. When it did not classify correctly, it usually 
chose the neutral activity, which is typically not as bad as 
choosing a wrong "active" activity. When we look at the 
confusion matrices, we see that many of the mistakes are 
actually reasonable in that the algorithm confuses them with 
very similar activities. For example, cooking-chopping and 
cooking- stirring are often confused, rinsing mouth with water 



is confused with brushing teeth, and talking on the couch is 
confused with relaxing on the couch. 

Another strength of our model is that it correctly classifies 
random data as neutral most of the time, as shown in the 
bottom row of the confusion matrices. This means that it 
is able to distinguish whether the provided set of activities 
actually occurs or not — thus our algorithm is not likely to 
misfire when a person is doing some new activity that the 
algorithm has not seen before. Also, since we trained on both 
the regular and mirrored data, the model performs well with 
both left- and right-handed people. 

However, there are some limitations to our method. First, 
our data only included cases in which the person was not 
occluded by an object; our method does not model occlusions 
and may not be robust to such situations. Second, some 
activities require more contextual information other than 
simply human pose. For example, knowledge of objects 
being used could help significantly in making human activity 
recognition algorithms more powerful in the future. 

V. Conclusion 
In this paper, we considered the problem of detecting and 
recognizing activities that humans perform in unstructured 
environments such as homes and offices. We used an inex- 
pensive RGBD sensor (Microsoft Kinect) as the input sensor, 
the low cost of which enables our approach to be useful 
for applications such as smart homes and personal assis- 
tant robots. We presented a two-layered maximum entropy 
Markov model (MEMM). This MEMM modeled different 
properties of the human activities, including their hierar- 
chical nature, the transitions between sub-activities over 
time, and the relation between sub-activities and different 
types of features. During inference, our algorithm exploited 
the hierarchical nature of human activities to determine 
the best MEMM graph structure. We tested our algorithm 
extensively on twelve different activities performed by four 
different people in five different environments, where the 
test activities were often interleaved with random activities 
not belonging to these twelve categories. It achieved good 
detection performance in both settings, where the person was 
and was not seen before in the training set, respectively. 
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